I Introduction & Motivation

Figure 1

Due to the climate change and rising temperature, the environment and all living beings have suffered tremendously from wildfire, which becomes more frequent and severe in recent decade. Especially in California, over the years, rampant wildfires have plagued the state of California, creating economic and environmental loss. In 2018, wildfires claimed more than 100 lives in California, over 1.6 million acres of land has burned and caused large sums of environmental damage. Although the government have paid much attention to this issue, and many efforts have been made, the situation is not obviously changed. Last year (2020), there are still over 4.2 million acres of land has burned by around 10,000 fire incidents and more than 10,000 structures damage.

Not only the physical damages, but wildfires also bring great financial losses. California’s one of the largest wildfires happened in 2018, which cost $148.5 billion (roughly 1.5% of California’s annual gross domestic product). The financial cost includes capital losses, health costs, indirect losses, and the money that state spends every year to fund firefighting through the California Department of Forestry and Fire Protection.

Currently, there are many applications and websites that provide people with information about the wildfire prediction. But there is a little focus on the financial analysis, without a broad and comprehensive evaluation of wildfire risks and various risk mitigation strategies, it will be difficult to efficiently and effectively allocate additional funding related to wildfires. For example, in the absence of high‑quality information about the cost‑effectiveness of different types of risk reduction and response efforts, the state might not effectively balance funding for prevention and mitigation with funding for response capacity. Our project, unlike existing projects, combines wildfire risk and financial cost to help the government can be aware that to decrease the probability of wildfire occurrence, at least how much funding they should apply for.

Specifically, this project builds a Logistic Regression model to predict probability occurrence rate of wildfires in part of Northern California by integrating various data sources and we take 5 years’ records into consideration. Model is validated across 100 folders and spatial patterns. At the end we have a cost-benefit analysis to optimize the threshold on which we can have the minimum cost. The project offers insights on potential wildfire risk and effective funding allocation for California Department of Forestry and Fire Protection.

Click here to access the Fire Manager introduction video.

II Set up

To begin with, I loaded libraries that are needed in this analysis, standardized the formatting of maps and plots as well as identify functions for further distance analysis.

III Fishnet creating & feature engineering

3.1 Data Loading

This project includes 9 datasets in total.

  • Wildfire records: fire perimeters dataset is an ESRI ArcGIS file geodatabase with three data layers: – A layer depicting wildfire perimeters from contributing agencies current as of the previous fire year; – A layer depicting prescribed fires supplied from contributing agencies current as of the previous fire year; – A layer representing non-prescribed fire fuel reduction projects that were initially included in the database. Fuels reduction projects that are non prescribed fire are no longer included. We use this fundamental dataset to count the incidents happened in the past 5 years(2015-2019) as well as the situation in 2020.

  • Study area boundary: Based on the historical wildfire cases in the past 6 years (2015-2020), we focus on North California area, where had more severe situation in terms of wildfires, in this project. Specifically, we filter 25 counties from the whole boundary in our application: Del Norte, Siskiyou, Humboldt, Trinity, Modoc, Shasta, Lassen, Plumas, Tehama, Mendocino, Glenn, Lake, Colusa, Butte, Sutter, Yuba, Sierra, Nevada, Placer, El Dorado, Sacramento, Sonoma, Napa, Yolo, Solano. Figure 3.1-1

  • Slope in the area: We use ArcGIS Pro to calculate the slope from 90m DEM of California.

  • Aspect in the area: We use ArcGIS Pro to calculate the aspect form 90m DEM of California.

  • Elevation in the area: We use ArcGIS Pro to calculate the Elevation from 90m DEM of California.

  • Land cover: A shapefile contains information of land use and land cover in California. According to the study area, Modoc land cover, Sacramento land cover, Sierra Nevada land cover, Bay Delta land cover, and North Coast land cover are used in this project.

  • Aspects of the wildlife urban interface (WUI): This data depicts different aspects of wildland urban interface, which indicates by Hazard Type:not FHSZ, moderate, high, very high.

  • Weather: Using the riem package in R to get the weather information including temperature, perticipation, humidity, and wind speed (More details about the weather processing will discuss in later section).

  • Powerline: California Electric Transmission Lines is a Geojson dataset. The reason we consider the influence of powerline in our model is because it can create a micro weather, so combing with weather data, it is effective in correlating the wildfire ignition conditions (powerline related fires usually occur during the windy weather).

# California Counties
ca_counties <- st_read("https://opendata.arcgis.com/datasets/a61c138d0a6946da8d1ebb8d1c9db13a_0.geojson")%>%
  st_transform('EPSG:2225')


# Fire Perimeter
fire <- st_read("https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_Fire_Perimeters/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")%>%
  st_transform('EPSG:2225')


# fire cases in 2020
fire_2020 <- fire %>% filter(YEAR_=="2020")

# Zoom in the study area (25 counties)
N_California <- st_read("https://opendata.arcgis.com/datasets/a61c138d0a6946da8d1ebb8d1c9db13a_0.geojson") %>%
  filter(COUNTY_NAME == 'Del Norte' | COUNTY_NAME == 'Siskiyou' | COUNTY_NAME == 'Humboldt'| COUNTY_NAME == 'Trinity'| COUNTY_NAME == 'Modoc'| COUNTY_NAME == 'Shasta'| COUNTY_NAME == 'Lassen'|COUNTY_NAME == 'Plumas' | COUNTY_NAME == 'Tehama' | COUNTY_NAME == 'Mendocino' | COUNTY_NAME == 'Glenn' |COUNTY_NAME == 'Lake' | COUNTY_NAME == 'Colusa' | COUNTY_NAME == 'Butte' |COUNTY_NAME == 'Sutter'|COUNTY_NAME == 'Yuba' |COUNTY_NAME == 'Sierra' |COUNTY_NAME == 'Nevada' |COUNTY_NAME == 'Placer' |COUNTY_NAME == 'El Dorado' |COUNTY_NAME == 'Sacramento'| COUNTY_NAME == 'Sonoma' |COUNTY_NAME == 'Napa'| COUNTY_NAME == 'Yolo'|COUNTY_NAME == 'Solano') %>%
  st_transform('EPSG:2225')

# Filter Northern California fire
N_fire2020 <- st_join(fire_2020, N_California, join = st_intersects,left = FALSE)

# convert the date format -- fire records in 2020 are from Feb. 14 to DEC.30 (for weather processing later)
msValue1 = N_fire2020$ALARM_DATE
N_fire2020 <- N_fire2020 %>%
  mutate(ALARM_DATE = as.character(as.POSIXct(msValue1/1000, origin="1970-01-01")))

msValue2 = N_fire2020$CONT_DATE
N_fire2020 <- N_fire2020 %>%
  mutate(CONT_DATE = as.character(as.POSIXct(msValue2/1000, origin="1970-01-01")))

## Wildfire in North California Boundary
ggplot() +
  geom_sf(data = N_fire2020, fill="red", color="transparent")+
  geom_sf(data=N_California, fill="transparent", color="grey", size=0.5)+
  labs(title="North California Wildfires2020",
       subtitle="for study area",
       caption="Figure 3.1-2")+
  mapTheme()

3.2 Create Fishnet of Northen California wildfires

Since there are a great amount of data need to process, and most of the terrain data are in raster form, we do the initial fishnet creating and data wrangling in ArcGIS Pro.

First, we created a 3 x 3 mile fishnet for the study area. We chose this size because .25 square miles is approximately the size of a residential neighborhood (Coulton et al., 2013), however considering 0.5 x 0.5 mile unit is too small for our study area as well as R will be overloaded to process it, then we take multiples of 0.5 and make sure the unit is appropriate for the study area, thus 3 x 3 mile, 7831 units in total. Figure 3.2-1 Then, we get the slope shapefile by geoprocessing the 90m DEM data and use the conversion tool convert raster to polygon. Figure 3.2-2 Figure 3.2-3 The same process are used to get the aspect and elevation shapefile.

The last step is join all the data we process in GIS to the each fishnet unit accordingly. Here, we use centroid to locate the fishnet, if the centroid of the unit is within certain spatial features, then this unit has the same character. Therefore, in Spatial Join we connect the fishnet point layer to feature layers, and join them one to many to ensure each unit has only one value without duplicates. Since we already clipped the study area from the whole California boundary, here we delete the nullvalue in the fishnet, which means the left data will be exactly for those 25 counties.

Figure 3.2-4 Figure 3.2-5 use land cover as an example Other features are joined by the same process, and the fishnet are created as shown below. Figure 3.2-6 After importing all_data fishnet into R, we then continue to complete the dataset for analysis. First, the wildfire records are connect to the fishnet, since we use logistic regression model with binary dependent variable: fire or no fire, we mutate the 0 record to NO_FIRE, other numbers are all classified as FIRE.

# change the column name to be more clear and select useful ones
all_data <- read_sf('data\\all_data')%>%
  st_transform('EPSG:2225')%>%
  mutate(uniqueID = JOIN_FID_1 + 1,
         "County"= NAME, 
         "HAZ_type" = HAZ_DESC, 
         "Vegetation" = WHR13NAME1, 
         "Aspect" = gridcode, 
         "Elevation" = GRIDCODE_1, 
         "Slope" = gridcode11)

fishnet <- all_data%>%
  select(uniqueID,County,HAZ_type,Vegetation,Aspect,Elevation,Slope,geometry)

# count the number of wildfire records in each fishnet
N_fire1 <- N_fire2020%>%
  dplyr::select(geometry)%>%
  mutate(countfire = 1)

fire_net <- st_join(st_centroid(fishnet), N_fire1, join =
                      st_intersects,left = FALSE)%>%
  st_drop_geometry()

fire_net2020 <- left_join(fishnet, fire_net, by = "uniqueID")%>%
  dplyr::select(uniqueID, County.x, HAZ_type.x, Aspect.x,
                Elevation.x, Slope.x, geometry, countfire)

fire_net2020[is.na(fire_net2020)] = 0

# mutate the result to binary
firenet2020 <- fire_net2020 %>% 
  group_by(uniqueID) %>% 
  summarise(counts = sum(countfire))
  firenet20 <- firenet2020 %>% mutate(
    fire = ifelse(counts =="0","NO_FIRE","FIRE"),
    "uniqueID" = fishnet$uniqueID,
    "County" = fishnet$County, 
    "HAZ_type" = fishnet$HAZ_type, 
    "Vegetation" = fishnet$Vegetation, 
    "Aspect" = fishnet$Aspect, 
    "Elevation" = fishnet$Elevation, 
    "Slope" = fishnet$Slope, 
    "geometry" = fishnet$geometry )

ggplot()+
  geom_sf(data = firenet20, aes(fill=fire), color=NA)+
  scale_fill_manual(values = c("red","grey"))+
  labs(title="North California Wildfires in fishnet 2020",
       subtitle="for study area",
       caption="Figure 3.2-7")+
  mapTheme()

Then we add powerline data, historical records and weather data in the fishnet:

For powerline, we spatial join the powerline and fishnet, null values were replaced with zero as the powerlines were not going through those grids, the outcome would be powerline and no_powerline.

## 1. Powerline
e_stransmission<-
  st_read("data//California_Electric_Transmission_Lines.geojson") %>%
  st_transform('EPSG:2225')%>%
  st_transform(st_crs(fire_net2020)) %>%
  dplyr::select(geometry)

e_net <- st_join(fishnet, e_stransmission, join =
                   st_intersects,left = FALSE)%>%
  mutate(e = 1)%>%
  group_by(uniqueID) %>% 
  summarise(e_number = sum(e))

e_net <- e_net%>%st_drop_geometry()
e_net1 <- left_join(fishnet, e_net, by = "uniqueID")
e_net1[is.na(e_net1)] = 0

Enet <- e_net1 %>% 
  mutate(e_number = ifelse(e_number == "0", "no_powerline",
                           "powerline"))%>%
  rename("Powerline" = e_number)%>%
  select(uniqueID, Powerline)%>%
  st_drop_geometry()

For historical records, we collect 5 years records from 2015 to 2019, the process is the same as we do for 2020 data, except for the historical data we remain the counts as one feature.

## 2. Historical Fire (area)
fire_perimeter <- fire %>% filter(YEAR_=="2015"|YEAR_=="2016"|YEAR_=="2017"|YEAR_=="2018"|YEAR_=="2019")

fire_records <- st_join(fire_perimeter, N_California, join =
                          st_intersects,left = FALSE)%>%
  mutate(firerecord = 1)%>%
  dplyr::select(geometry,firerecord)

records_net <- st_join(st_centroid(fishnet), fire_records, join
                       = st_intersects,left = FALSE)%>%
  st_drop_geometry()

Historical_Fire <- left_join(fishnet, records_net, by =
                               "uniqueID")%>%
  dplyr::select(uniqueID, County.x, HAZ_type.x, Aspect.x,
                Elevation.x, Slope.x, geometry, firerecord)

Historical_Fire[is.na(Historical_Fire)] = 0

Historical_Fire <- Historical_Fire %>%
  group_by(uniqueID) %>% 
  summarise(firerecordcounts = sum(firerecord))  

Historical_Fire <- Historical_Fire %>% st_drop_geometry()

firenet20_his <- cbind(firenet20,Historical_Fire,Enet) %>%
  rename("his_counts" = firerecordcounts) %>%
  select(-uniqueID.1,-uniqueID.2)

For weather, we identified weather stations that were in and around our selected counties. We then associated each of these stations with our fishnet using the nearest neighbor function and retrieved weather data from each station for fire season months (February to December) in 2020.

# 3. Weather
## find the weather stations in California
riem_networks()
ca_stations = riem_stations("CA_ASOS")
ca_stations2 <- st_as_sf(ca_stations, coords = c("lon", "lat"), crs = 4326)

## identify the station in our area
weather_station_ids <- c("SIY", "CEC", "MHS", "O86", "ACV", "FOT", "RDD", "RBL", "CIC", "OVE","UKI", "MYV", "STS", "O69", "DVO", "APC", "SUU", "VCB", "EDU", "SMF", "LHM", "MYV", "AAT", "SVE", "PVF", "AUN","SAC", "MCC", "CIC","JAQ","MHR")

## df - stations with lat/lon and name info (in addition to ids)
asos_socal_stations <- riem_stations("CA_ASOS") %>%
  filter(str_detect(id, paste(weather_station_ids,collapse="|")))

asos_socal_stations$weather_station_id <- asos_socal_stations$id

asos_socal_stations <-  st_as_sf(asos_socal_stations, coords =
                                   c("lon","lat"), crs = 4326,
                                 agr = "constant") %>%
                        st_transform('EPSG:2225')

asos_socal_stations$weather_ID <- seq.int(nrow(asos_socal_stations))

## Finding closest station
weather_coords <- asos_socal_stations %>% select(geometry)

st_c <- st_coordinates
st_coid <- st_centroid

closest_weather_station_to_fishnet <- nn2(st_c(weather_coords), st_c(st_coid(fishnet)), k = 1)$nn.idx

firenet20_his$weather_ID <- closest_weather_station_to_fishnet

# Weather station plot
ggplot()+
  geom_sf(data = N_California,fill = "white")+
  geom_sf(data = weather_coords,color = "blue")+
  labs(title="Weather station in study area",
       subtitle = "bule dots denoted weather stations",
       caption="Figure 3.2-8")+
  mapTheme()

# create the function for weather feature
get_weather_features_by_station <- function(weather_station_ids, start_year, end_year){
  
  year_vec <- seq(start_year, end_year)
  i <- 1
  weather_data_list <- list()
  for(station_id in weather_station_ids){
    print(paste("Processing station", station_id))
    for(year in year_vec){
      start_date = paste0(year, "02-14")
      end_date = paste0(year, "12-30")
      weather_data <- riem_measures(station = station_id, date_start = start_date, date_end = end_date) %>% 
        dplyr::summarise(weather_station_id = station_id,
                         year = year,
                         Max_Temp = max(tmpf, na.rm = TRUE),
                         Mean_Temp = mean(tmpf, na.rm = TRUE),
                         Mean_Precipitation = mean(p01i, na.rm = TRUE),
                         Mean_Humidity = mean(relh, na.rm = TRUE),
                         Mean_Wind_Speed = mean(sknt, na.rm = TRUE),
        ) 
      weather_data_list[[i]] <- weather_data
      i <- i + 1
    }
  }
  
  do.call("rbind", weather_data_list) 
}

weather_data2020 <- get_weather_features_by_station(weather_station_ids, 2020, 2020) %>%
  rename(Max_Temp20 = Max_Temp,
         Mean_Temp20 = Mean_Temp,
         Mean_Precipitation20 = Mean_Precipitation,
         Mean_Humidity20 = Mean_Humidity,
         Mean_Wind_Speed20 = Mean_Wind_Speed)

weather_2020 <- left_join(weather_data2020, asos_socal_stations, on = 'weather_station_id') %>%
  select (-weather_station_id, -id, -name, -year, -geometry) %>%
  distinct() 

alldataset2020 <- left_join(firenet20_his, weather_2020, on = "weather_ID")

3.3 Feature engineering

Here, we mutated aspects from numeric into categories to be more clear. When build the model, we first remain all the category in land cover, in summary we found out its categories should be mutate into more concise and meaningful categories, so after engineering this feature, our McFadden improved by 0.1.

# Aspect
engineeredData <-
  alldataset2020 %>%
  mutate(Aspect = case_when(
    Aspect == -1 ~ "Flat",
    Aspect >= 0 & Aspect <= 22.5 ~"North",
    Aspect > 22.5 & Aspect <= 67.5 ~"Northeast",
    Aspect > 67.5 & Aspect <= 112.5 ~"East",
    Aspect > 112.5 & Aspect <= 157.5 ~"Southeast",
    Aspect > 157.5 & Aspect <= 202.5 ~"South",
    Aspect > 202.5 & Aspect <= 247.5 ~"Southwest",
    Aspect > 247.5 & Aspect <= 292.5 ~"West",
    Aspect > 292.5 & Aspect <= 337.5 ~"Northwest",
    Aspect > 337.5 & Aspect <= 360 ~"North"))

# land cover
engineeredData <-
  engineeredData %>%
    mutate(Vegetation = case_when(
    Vegetation == "Conifer Woodland" ~ "Conifer Woodland",
    Vegetation == "Hardwood Forest"  ~ "Hardwood Forest",
    Vegetation == "Shrub"  ~ "Shrub",
    Vegetation == "Agriculture"  ~ "Agriculture",
    Vegetation == "Wetland" |Vegetation == "Water"|Vegetation ==
      "Urban" |Vegetation == "Herbaceous"|Vegetation == "Hardwood Woodland" |Vegetation == "Desert Shrub"|Vegetation ==
      "Conifer Forest" |Vegetation == "Barren/Other"  ~ "Other"))

IV Exploratory Analysis

There are 9 features have the possibility that count for whether the area would have wildfire or not. These variables can be classified as continuous features and categorical features. Here, for better exploratory conclusion, we visualize each type separately in plots below. 538 of 7831 units in the dataset have wildfire.

4.1 Continuous features

As shown in figure 4.1-1, the correlation between continuous features and whether there is wildfire in the area, the relative obvious feature is the slope, the smoother the terrain, the less chance that the area would have a wildfire. However, the reason why these factors seem not to be directly correlated with the dependent variable is none of this factors can influence the possibility of wildfire, many of them can be seen as a booster, combing with other features would make a big difference in terms of wildfire occurrence. The same conclusion can be made from figure 4.1-2, from the different peak trend in this outcome of density, we can see that the correlation between slope and fire. This outcome suggests that prediction of wildfire is a comprehensive process, there is no any specific dominant factor that causes the wildfire, all things should be considered together. Since logistic regression is based on OLS regression, we further confirm there is no multicollinearity among our independent variables as figure 4.1.3 shows. Unsurprisingly, mean temperature and mean maximum temperature are highly positively correlated. The correlation matrix also shows that temperature is negatively correlated with humidity.

# outcomes on average
engineeredData %>% st_drop_geometry() %>%
  dplyr::select(fire, Elevation, Slope,Max_Temp20,
                Mean_Temp20,Mean_Humidity20,
                Mean_Precipitation20,Mean_Wind_Speed20) %>%
  gather(Variable, value, -fire) %>%
  ggplot(aes(fire, value, fill=fire)) + 
  geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
  facet_wrap(~Variable, scales = "free") +
  scale_fill_manual(values = palette2) +
  labs(x="Fire", y="Mean", 
       title = "Feature associations with the likelihood of
       Wildfire",
       subtitle = "(continous Features)",
  caption = 'Figure 4.1-1')+
  theme(legend.position = "none")

# outcomes in density
engineeredData %>% st_drop_geometry() %>%
  dplyr::select(fire, Elevation, Slope,Max_Temp20,
                Mean_Temp20,Mean_Humidity20,
                Mean_Precipitation20,Mean_Wind_Speed20) %>%
  gather(Variable, value, -fire) %>%
    ggplot() + 
    geom_density(aes(value, color=fire), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = c("#99CCCC","#FFCC99"),
                      name = 'FIRE') +
    labs(title = "Feature associations with the likelihood of Wildfire",
         subtitle = "(continous outcomes in density)",
         caption = 'Figure 4.1-2') +
    theme(legend.position = "none")

## correlation between continuous features
numericVars <- select_if(engineeredData , is.numeric) %>% na.omit() %>% st_drop_geometry() %>%
  dplyr::select(Elevation, Slope,Max_Temp20, Mean_Temp20,Mean_Humidity20,
                Mean_Precipitation20,Mean_Wind_Speed20)

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c('#05A167', "white", '#6897BB'),
  lab_size = 1,
  tl.cex = 8,
  insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1,lab = TRUE) +  
  labs(title = "Correlation across numeric variables\n",
       caption = 'Figure 4.1-3') +
  plotTheme()

4.2 Categorical features

We again visualize the correlation between categorical features and dependent variable. Instead of showing the total number of fire or no fire records of each category, we visualize the outcome by percentage: in fire records and on fire records, what percentage is each category. Figure 4.2-4 blow indicates that Most fires seem to occur in high hazard type area which is outside of the wildlife urban interface; Since after engineering, the type of land cover are from 12 to 5 types, type other actually contains 8 types. Thus, Fires seem especially prevelant in areas dominated by shrub land.

## HAZ_type
tips <- engineeredData %>%st_drop_geometry()%>%
    dplyr::select(fire,HAZ_type) 

## Vegetation
tips2 <- engineeredData %>%st_drop_geometry()%>%
    dplyr::select(fire,Vegetation) 

## Historical records
tips3 <- engineeredData %>%st_drop_geometry()%>%
    dplyr::select(fire,his_counts) 

grid.arrange(ncol = 1,
             ggplot(tips, aes(x= HAZ_type,  group=fire)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
    geom_text(aes( label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count", vjust = -.5) +
    scale_fill_manual(values = c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603'))+
    labs(y = "Percent", fill="HAZ_type",title = "Categorical features and Dependent variable") +
    facet_grid(~fire) +
    scale_y_continuous(labels = scales::percent),
    
    ggplot(tips2, aes(x= Vegetation,  group=fire)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
    geom_text(aes( label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count", vjust = -.5) +
    scale_fill_manual(values = c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603',"#FE9900"))+
    labs(y = "Percent", fill="Vegetation") +
    facet_grid(~fire) +
    scale_y_continuous(labels = scales::percent),
    
    ggplot(tips3, aes(x= his_counts,  group= fire)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
    geom_text(aes( label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count", vjust = -.5) +
    scale_fill_manual(values = c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603',"#FE9900"))+
    labs(y = "Percent", fill="his_counts",caption = 'Figure 4.2-4')+
    facet_grid(~fire) +
    scale_y_continuous(labels = scales::percent))

V Model building and Prediction

5.1 Create a logistic regression model

First, We randomly split the dataset into 65/35 training/test set.

set.seed(777)
engineeredData <-
  engineeredData %>% mutate(fire_numeric = ifelse(counts == "0", 0, 1))
trainIndex <- createDataPartition(engineeredData$fire, p = .65,
                                 y = paste(
                                   engineeredData$HAZ_type,
                                   engineeredData$Vegetation,
                                   engineeredData$Aspect,
                                   engineeredData$powerline),
                                  list = FALSE,
                                  times = 1)

fireTrain <- engineeredData[ trainIndex,] %>% st_drop_geometry()
fireTest  <- engineeredData[-trainIndex,] %>% st_drop_geometry()

Since we are more concerned about whether there will be wildfire or not, the independent variable in this model consists of two levels: FIRE and NO_FIRE to identify whether there is wildfire in the area. Therefore, we will build a Logistic Regression model. As discussed in the previous section, after feature engineering, current model perform better than the the first one we built concerning McFadden and POC curve.

# Model
fireModel <- glm(fire_numeric~ .,
                    data=fireTrain %>% 
                      dplyr::select(-uniqueID,-counts,-weather_ID, -fire),
                    family="binomial" (link="logit"))

stargazer(fireModel, type = "html", 
          title = "Table 5.1 Summary Statistics of fireModel",
          header = FALSE,
          single.row = TRUE,
          column.labels=c("fireModel"))
Table 5.1 Summary Statistics of fireModel
Dependent variable:
fire_numeric
fireModel
CountyButte 16.746 (2,283.129)
CountyColusa -1.346 (2,690.777)
CountyDel Norte 13.205 (2,283.129)
CountyEl Dorado 13.620 (2,283.129)
CountyGlenn 17.368 (2,283.129)
CountyHumboldt 13.599 (2,283.129)
CountyLake 16.164 (2,283.129)
CountyLassen 16.034 (2,283.129)
CountyMendocino 16.130 (2,283.129)
CountyModoc 15.058 (2,283.129)
CountyNapa 19.114 (2,283.129)
CountyNevada -2.405 (2,863.134)
CountyPlacer -2.128 (2,616.087)
CountyPlumas 15.730 (2,283.129)
CountySacramento 16.865 (2,283.129)
CountyShasta 13.709 (2,283.129)
CountySierra 14.521 (2,283.129)
CountySiskiyou 14.719 (2,283.129)
CountySolano 16.035 (2,283.129)
CountySonoma 17.022 (2,283.129)
CountySutter 1.152 (2,867.105)
CountyTehama 14.840 (2,283.129)
CountyTrinity 16.213 (2,283.129)
CountyYolo 19.297 (2,283.129)
CountyYuba -2.050 (2,999.410)
HAZ_typeModerate 0.576* (0.310)
HAZ_typeNot FHSZ -16.876 (552.209)
HAZ_typeVery High 1.239*** (0.246)
VegetationConifer Woodland 2.800*** (0.985)
VegetationHardwood Forest 2.309*** (0.834)
VegetationOther 1.777** (0.826)
VegetationShrub 2.891*** (0.835)
AspectFlat -13.944 (2,536.272)
AspectNorth -0.107 (0.549)
AspectNortheast -0.274 (0.697)
AspectNorthwest -0.469 (0.771)
AspectSouth -0.834 (0.768)
AspectSoutheast 0.371 (0.610)
AspectSouthwest -1.367 (0.884)
AspectWest -1.022 (0.877)
Elevation 0.001*** (0.0002)
Slope 0.053*** (0.010)
his_counts -0.608*** (0.113)
Powerlinepowerline -0.973*** (0.230)
Max_Temp20 0.050* (0.030)
Mean_Temp20 0.114*** (0.034)
Mean_Precipitation20 127.446*** (35.885)
Mean_Humidity20 0.071*** (0.017)
Mean_Wind_Speed20 0.505*** (0.080)
Constant -41.288 (2,283.132)
Observations 5,144
Log Likelihood -849.610
Akaike Inf. Crit. 1,799.221
Note: p<0.1; p<0.05; p<0.01

5.2 Wildfire probabilities prediction

The plots below show the distribution of predicted probabilities for FIRE and NO_ FIRE. The predicted probabilities for no fire cluster around 0. However, the predicted probabilities for areas that do have fire do not cluster around 1. This is because wildfire is a relatively rare event, so very few grid cells have a high probability of wildfire. Only 6% of our fishnet grid cells had wildfires during the time period under analysis.

#Distribution of Predicted Probabilities
testProbs <- data.frame(Outcome = as.factor(fireTest$fire_numeric),
                        Probs = predict(fireModel, fireTest, type= "response"))

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Fire", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome",
       caption = 'Figure 5.2-1') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

VI Model Evaluation and Validation

6.1 Pseudo R Squared

The McFadden pseudo r squared for this model is about 0.36, which has been improved compared to our initial model. While this number is not particularly meaningful on its own, the McFadden pseudo r squared results for different iterations of the model were compared. A higher McFadden pseudo r squared indicates better accuracy.

# Fit metrics
pR2(fireModel)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
##  -849.6103465 -1319.9798792   940.7390655     0.3563460     0.1671326 
##          r2CU 
##     0.4163433

6.2 ROC Curve for Model

Theoretically, the ROC curve indicates a TP vs. FP rate at one certain decision threshold, while AUC provides an aggregate measure of performance across all possible classification thresholds. Shown in the graph below, for every additional improvement in true positive rate, the model will make a greater proportion of false positive errors. The AUC of 0.9 indicates that the model has a high goodness of fit. However, we still need to be concerned about the issue of overfiting.

## This us a goodness of fit measure, 1 would be a perfect fit, .5 is a coin toss
auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.9045
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve",
       caption = 'Figure 6.2-1')

6.3 Good of Fit: Confusion Matrix

A “confusion matrix” for the threshold of 50% shows us the rate at which we got True Positives (aka Sensitivity), False Positives, True Negatives (aka Specificity) and False Negatives for that threshold.

In this case, True Positives: Predicted correctly the occurrence of wildfire. False Positives: Predicted incorrectly the occurrence of wildfire. true Negatives: Predicted correctly there is no fire. False Negatives: Predicted incorrectly there is no fire.

For the Fire Manager app, we are more interested in increasing the true positive rate than in decreasing the false positive rate. So far, the sensitivity of the model is 0.32.

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5, 1, 0)))

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 2475 126 1 26 60

           Accuracy : 0.9434              
             95% CI : (0.934, 0.9519)     
No Information Rate : 0.9308              
P-Value [Acc > NIR] : 0.004508            
                                          
              Kappa : 0.4156              
                                          

Mcnemar’s Test P-Value : 0.000000000000000975

        Sensitivity : 0.32258             
        Specificity : 0.98960             
     Pos Pred Value : 0.69767             
     Neg Pred Value : 0.95156             
         Prevalence : 0.06922             
     Detection Rate : 0.02233             

Detection Prevalence : 0.03201
Balanced Accuracy : 0.65609

   'Positive' Class : 1                   
                                          

6.4 Cross Validation on 100 k-folds

This section conducted cross-validation on the model. The outputs include the ROC, sensitivity and specificity across the 100 k-folds. Figure 6.4-1 below visualizes the area under the ROC curve, sensitivity, and specificity of the final model across folds. This model generalizes well concerning specificity. However, the generalizability of the model’s sensitivity should be improved.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(fire ~ ., data = engineeredData %>% st_drop_geometry() %>%
                 dplyr::select(-uniqueID,-counts,-weather_ID, -fire_numeric)%>%na.omit(),
               method="glm", family="binomial",
               metric="ROC", trControl = ctrl)

cvFit
dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
  geom_histogram(bins=35, fill = "#FF006A") +
  facet_wrap(~metric) +
  geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
       subtitle = "Across-fold mean represented as dotted lines",
       caption = 'Figure 6.4-1') +
  plotTheme()

6.5 Spatial Cross Validation

In order to further test the generalization of our model, we carry out spatial cross validation. Using leave one out method and make counties as “spatial folders”, which allows us to see how well our model predicted across space.

The map below visualizes actual fires and predicted probabilities for each of the 25 counties. This shows that the model basically generalizes well in space, however, we seems to under predict the northeast part of the area. Considering in a long time, northeast of California suffered much less from wildfire than other parts in this area, it is reasonable to have a lower probability in such area. For counties that have more clustered wildfires, the model can well predict the high probability. Still, some counties have more areas with high predicted probability of fire where no fire actually occured. This suggests that more work could be done to improve the generalizability of the model across space.

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(fire_numeric ~ ., family = "binomial", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

reg.vars <- c("County", "HAZ_type", "Vegetation", "Aspect", "Elevation", "Slope", "his_counts",
              "Powerline", "weather_ID", "Max_Temp20", "Mean_Temp20", "Mean_Precipitation20",
              "Mean_Humidity20", "Mean_Wind_Speed20")


reg.spatialCV <- crossValidate(
  dataset = engineeredData,
  id = "County",
  dependentVariable = "fire_numeric",
  indVariables = reg.vars) 

reg.spatialcv <-
  reg.spatialCV %>%
  dplyr::select("cvID" = County, fire_numeric, Prediction, geometry)

ggplot() +
  geom_sf(data = reg.spatialcv, aes(fill = Prediction), color = "transparent")+
  geom_sf(data = N_fire2020, fill = "transparent", color = "red", size=.5)+
  labs(title="Predicted Probabilities and Actual Fires",
       subtitle="Red outline marks perimeter of actual fires",
       caption = 'Figure 6.6-1')+
  mapTheme()

VII Cost-Benefit

7.1 Equation

Because From the department’s points of view. Our target value would be cost instead of revenue, considering wildfire management is the government’s responsibility to spend money and time to protect citizens’ safety and the nature we are living in. In this calculation, we use fishnet(3 x3 mile) as price unit, and multiply by 640 to get the price for each fishnet grid.

  1. True Positive - Allocated funding for prescribed fire.

  2. True Negative - No prevention funding were allocated.

  3. False Positive - Allocated the funding for prescribed fire.

  4. False Negative - No prevention funding were allocated, and fire caused damage.

input <- c(
'True_Positive', '-$200','-$0','$0', '3x3(mile^2)x640x(-$200)= -$115.2 million',
'True_Negative', '$0','$0','$0', '$0',
'False_Positive','-$200','$0','$0', '3x3(mile^2)x640x(-$200)= -$115.2 million',
'False_Negative','$0','$-800','$-1345','3x3(mile^2)x640x(-$2145)= -$1235.5 million' )


equation <- matrix(input, ncol = 5, byrow = TRUE) %>% 
  as_tibble()  

names(equation) <- c('Variable','Prevention Allocation', 'Suppressing Fire Cost', 'Damage Cost', 'Outcome/Grid')

kable(equation) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 7.1 Equation Table\n",
           general_title= '\n')
Variable Prevention Allocation Suppressing Fire Cost Damage Cost Outcome/Grid
True_Positive -$200 -$0 $0 3x3(mile^2)x640x(-$200)= -$115.2 million
True_Negative $0 $0 $0 $0
False_Positive -$200 $0 $0 3x3(mile^2)x640x(-$200)= -$115.2 million
False_Negative $0 $-800 $-1345 3x3(mile^2)x640x(-$2145)= -$1235.5 million

Table 7.1 Equation Table

7.2 Cost-Benefit Table

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1])
                ) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
              ifelse(Variable == "True_Positive", Count *-115.2,
              ifelse(Variable == "True_Negative", Count *0,
              ifelse(Variable == "False_Positive", Count *-115.2,
              ifelse(Variable == "False_Negative", Count *-1235.5,0
                         )))))%>%
    
    bind_cols(data.frame(Description = c(
              "Predicted correctly, fire occured, prevention funding is allocated to prevent fire, no damage",
              "Predicted correctly, no fire, no funding used, no damage",
              "Predicted incorrectly, no fire, prevention funding is allocated",
              "Predicted incorrectly, fire occured, no prevention, caused damge")))


kable(cost_benefit_table)%>%
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed"))%>%
  footnote(general = "Table 7.2 Cost/Benefit Table\n",general_title= '\n')
Variable Count Revenue Description
True_Positive 60 -6912.0 Predicted correctly, fire occured, prevention funding is allocated to prevent fire, no damage
True_Negative 2475 0.0 Predicted correctly, no fire, no funding used, no damage
False_Positive 26 -2995.2 Predicted incorrectly, no fire, prevention funding is allocated
False_Negative 126 -155673.0 Predicted incorrectly, fire occured, no prevention, caused damge

Table 7.2 Cost/Benefit Table

7.3 Optimize Thresholds

The Figure 7.3 indicates that with the increase of the threshold, the cost gradually decrease and counts increases as well. As we discussed before, reducing the damage cost at price of efficient allocation of funding is the ultimate goal of government. Considering the trade-off in order to minimize the cost, we choose 0.1 as the optimized threshold, after optimization, the sensitivity improves to 0.79 from original 0.32 when the threshold is 0.5.

whichThreshold =
  iterateThresholds(
     data=testProbs, observedClass = Outcome, predictedProbs = Probs)

whichThreshold =
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
             case_when(Variable == "Count_TP"  ~ Count * -115.2,
                       Variable == "Count_TN"  ~ 0,
                       Variable == "Count_FP"  ~ Count * -115.2,
                       Variable == "Count_FN"  ~ Count * -1235.5
                       ))
whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point(alpha=.5) +
  scale_colour_manual(values = palette5[c(1:5)]) +    
  labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue",
       caption = "Figure 7.3") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

testProbs2 <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.1, 1, 0)))

caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 2157 39 1 344 147

           Accuracy : 0.8575             
             95% CI : (0.8437, 0.8705)   
No Information Rate : 0.9308             
P-Value [Acc > NIR] : 1                  
                                         
              Kappa : 0.3711             
                                         

Mcnemar’s Test P-Value : <0.0000000000000002

        Sensitivity : 0.79032            
        Specificity : 0.86246            
     Pos Pred Value : 0.29939            
     Neg Pred Value : 0.98224            
         Prevalence : 0.06922            
     Detection Rate : 0.05471            

Detection Prevalence : 0.18273
Balanced Accuracy : 0.82639

   'Positive' Class : 1                  
                                         

VIII Conclusion.

In sum, we built the Logistic Regression model to predict probability occurrence rate of wildfires in part of Northern California in order to help California Department of Forestry and Fire Protection reduce wildfire risk and support prevention funding application. The AUC of 0.9 indicates that the model has a high goodness of fit. The sensitivity of the model is 0.32 after cross validation on 100 k-folds while the specificity is close to 1. This model generalizes well concerning specificity. But the sensitivity should be improved.

To improve the accuracy of the model, we need more data samples of historical wildfire incidents instead of the wildfires occurred in year 2020. Moreover, it is important to include more potential factors in the model, such as drought conditions and physical barriers, like state higtway, railroad and water streams.

Despite the limitations, we still believe this project is beneficial to Department of Forestry and Fire Protection and citizens. Using our app, the department can not only get a fire risk prediction(Figure 8), but also have a cost estimate of optimal choice when it comes to wildfire management.

testProbs_all <- data.frame(Outcome = as.factor(engineeredData$fire),
                        Probs = predict(fireModel, engineeredData, type= "response"),
                        uniqueID= engineeredData$uniqueID)

testProbs_all <- testProbs_all %>% mutate(Risk_Cat=
                                            case_when(Probs <=.03 ~ "Low",
                                                   Probs > 0.03 & Probs < .2 ~ "Medium",
                                                   Probs >=.2 ~ "High" ))

fishnet_clipped_probs <- left_join(engineeredData, testProbs_all, on = "uniqueID")

palette3 <- c("#B81D13","#008450","#EFB700")

ggplot() +
  geom_sf(data=fishnet_clipped_probs, aes(fill=Risk_Cat), color="transparent")+
  scale_fill_manual(values=palette3, name="Risk Score") +
  labs(title="Predicted Risk Scores Across 25 Northern California Counties",
       caption = "Figure 8")+
  mapTheme()